Overview

Dataset statistics

Number of variables18
Number of observations964
Missing cells0
Missing cells (%)0.0%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory129.1 KiB
Average record size in memory137.1 B

Variable types

CAT11
NUM6
BOOL1

Reproduction

Analysis started2020-07-16 09:49:31.047732
Analysis finished2020-07-16 09:49:40.183596
Duration9.14 seconds
Versionpandas-profiling v2.8.0
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml

Warnings

current_year has constant value "2016" Constant
current_time has a high cardinality: 953 distinct values High cardinality
source_name has a high cardinality: 170 distinct values High cardinality
destination_name has a high cardinality: 168 distinct values High cardinality
train_name has a high cardinality: 504 distinct values High cardinality
current_week is highly correlated with current_dateHigh correlation
current_date is highly correlated with current_week and 1 other fieldsHigh correlation
current_day is highly correlated with current_dateHigh correlation
current_time is uniformly distributed Uniform
id_code has unique values Unique

Variables

id_code
Categorical

UNIQUE

Distinct count964
Unique (%)100.0%
Missing0
Missing (%)0.0%
Memory size7.5 KiB
wrdomkpyagubyzx
 
1
bhhpnhuvhirkhin
 
1
tvpkzaytnnyhtuj
 
1
ajfcikhfciubgxk
 
1
ehhcktdhpckymcf
 
1
Other values (959)
959
ValueCountFrequency (%) 
wrdomkpyagubyzx10.1%
 
bhhpnhuvhirkhin10.1%
 
tvpkzaytnnyhtuj10.1%
 
ajfcikhfciubgxk10.1%
 
ehhcktdhpckymcf10.1%
 
mdrlwiczxvxhrqx10.1%
 
xocodwijjeuwvuv10.1%
 
gghsevkjxkcmxju10.1%
 
gssykqcbduuwtoq10.1%
 
znfjfgtmesawnns10.1%
 
Other values (954)95499.0%
 

Length

Max length15
Median length15
Mean length15
Min length15

current_date
Categorical

HIGH CORRELATION

Distinct count24
Unique (%)2.5%
Missing0
Missing (%)0.0%
Memory size7.5 KiB
2016-10-14
 
73
2016-10-06
 
68
2016-10-13
 
66
2016-10-11
 
64
2016-10-10
 
62
Other values (19)
631
ValueCountFrequency (%) 
2016-10-14737.6%
 
2016-10-06687.1%
 
2016-10-13666.8%
 
2016-10-11646.6%
 
2016-10-10626.4%
 
2016-10-21555.7%
 
2016-10-07545.6%
 
2016-10-17505.2%
 
2016-10-25495.1%
 
2016-10-19474.9%
 
Other values (14)37639.0%
 

Length

Max length10
Median length10
Mean length10
Min length10

current_time
Categorical

HIGH CARDINALITY
UNIFORM

Distinct count953
Unique (%)98.9%
Missing0
Missing (%)0.0%
Memory size7.5 KiB
10:31:36 PM
 
2
08:36:18 AM
 
2
09:36:55 PM
 
2
05:05:44 PM
 
2
06:08:58 PM
 
2
Other values (948)
954
ValueCountFrequency (%) 
10:31:36 PM20.2%
 
08:36:18 AM20.2%
 
09:36:55 PM20.2%
 
05:05:44 PM20.2%
 
06:08:58 PM20.2%
 
04:23:41 PM20.2%
 
08:56:07 AM20.2%
 
04:28:24 PM20.2%
 
08:18:10 AM20.2%
 
07:22:08 AM20.2%
 
Other values (943)94497.9%
 

Length

Max length11
Median length11
Mean length11
Min length11

source_name
Categorical

HIGH CARDINALITY

Distinct count170
Unique (%)17.6%
Missing0
Missing (%)0.0%
Memory size7.5 KiB
station$544
 
100
station$266
 
71
station$147
 
68
station$150
 
61
station$130
 
61
Other values (165)
603
ValueCountFrequency (%) 
station$54410010.4%
 
station$266717.4%
 
station$147687.1%
 
station$150616.3%
 
station$130616.3%
 
station$178282.9%
 
station$525262.7%
 
station$214212.2%
 
station$281202.1%
 
station$117181.9%
 
Other values (160)49050.8%
 

Length

Max length11
Median length11
Mean length10.97406639
Min length10

destination_name
Categorical

HIGH CARDINALITY

Distinct count168
Unique (%)17.4%
Missing0
Missing (%)0.0%
Memory size7.5 KiB
station$130
 
86
station$150
 
75
station$147
 
75
station$544
 
72
station$266
 
56
Other values (163)
600
ValueCountFrequency (%) 
station$130868.9%
 
station$150757.8%
 
station$147757.8%
 
station$544727.5%
 
station$266565.8%
 
station$185293.0%
 
station$178293.0%
 
station$525282.9%
 
station$214202.1%
 
station$177161.7%
 
Other values (158)47849.6%
 

Length

Max length11
Median length11
Mean length10.96887967
Min length10

train_name
Categorical

HIGH CARDINALITY

Distinct count504
Unique (%)52.3%
Missing0
Missing (%)0.0%
Memory size7.5 KiB
ICZVZV
 
22
ICWYR
 
17
ICWAT
 
16
SYXUUV
 
13
SSXRTS
 
12
Other values (499)
884
ValueCountFrequency (%) 
ICZVZV222.3%
 
ICWYR171.8%
 
ICWAT161.7%
 
SYXUUV131.3%
 
SSXRTS121.2%
 
ICXUXZ101.0%
 
ICWAR90.9%
 
ICXYAT80.8%
 
PTWWW80.8%
 
PTXAV70.7%
 
Other values (494)84287.3%
 

Length

Max length8
Median length6
Mean length5.623443983
Min length3
Distinct count3
Unique (%)0.3%
Missing0
Missing (%)0.0%
Memory size7.5 KiB
whber
960
qwnll
 
3
wsluu
 
1
ValueCountFrequency (%) 
whber96099.6%
 
qwnll30.3%
 
wsluu10.1%
 

Length

Max length5
Median length5
Mean length5
Min length5

longitude_source
Real number (ℝ≥0)

Distinct count170
Unique (%)17.6%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean4.281108537344398
Minimum2.6527700000000003
Maximum6.133331
Zeros0
Zeros (%)0.0%
Memory size7.5 KiB

Quantile statistics

Minimum2.65277
5-th percentile3.23107465
Q13.8419555
median4.356801
Q34.499323
95-th percentile5.50790115
Maximum6.133331
Range3.480561
Interquartile range (IQR)0.6573675

Descriptive statistics

Standard deviation0.578196479
Coefficient of variation (CV)0.1350576548
Kurtosis0.3915324059
Mean4.281108537
Median Absolute Deviation (MAD)0.2864185
Skewness0.09998739626
Sum4126.98863
Variance0.3343111684
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
3.71067510010.4%
 
4.715866717.4%
 
4.356801687.1%
 
4.336531616.3%
 
4.360846616.3%
 
4.421101282.9%
 
3.216726262.7%
 
4.482785212.2%
 
5.566695202.1%
 
4.56936181.9%
 
Other values (160)49050.8%
 
ValueCountFrequency (%) 
2.6527710.1%
 
2.6699410.1%
 
2.86894320.2%
 
2.92580930.3%
 
2.99928610.1%
 
ValueCountFrequency (%) 
6.13333110.1%
 
6.0371110.1%
 
5.97538110.1%
 
5.85491710.1%
 
5.80997120.2%
 

latitude_source
Real number (ℝ≥0)

Distinct count170
Unique (%)17.6%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean50.88968450311203
Minimum49.599996
Maximum51.925093
Zeros0
Zeros (%)0.0%
Memory size7.5 KiB

Quantile statistics

Minimum49.599996
5-th percentile50.59978665
Q150.824506
median50.88228
Q351.035896
95-th percentile51.19923
Maximum51.925093
Range2.325097
Interquartile range (IQR)0.21139

Descriptive statistics

Standard deviation0.2026782182
Coefficient of variation (CV)0.00398269748
Kurtosis5.135848634
Mean50.8896845
Median Absolute Deviation (MAD)0.1285955
Skewness-0.5555802221
Sum49057.65586
Variance0.04107846014
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
51.03589610010.4%
 
50.88228717.4%
 
50.845658687.1%
 
50.859663616.3%
 
50.835707616.3%
 
51.2172282.9%
 
51.197226262.7%
 
51.017648212.2%
 
50.62455202.1%
 
50.673667181.9%
 
Other values (160)49050.8%
 
ValueCountFrequency (%) 
49.59999610.1%
 
49.6805320.2%
 
50.20282120.2%
 
50.4047190.9%
 
50.41217110.1%
 
ValueCountFrequency (%) 
51.92509320.2%
 
51.53333310.1%
 
51.46276710.1%
 
51.36462310.1%
 
51.32203210.1%
 

mean_halt_times_source
Real number (ℝ≥0)

Distinct count144
Unique (%)14.9%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean271.76839337538695
Minimum11.973988439306
Maximum686.61560693642
Zeros0
Zeros (%)0.0%
Memory size7.5 KiB

Quantile statistics

Minimum11.97398844
5-th percentile26.90578035
Q172.32947977
median202.1878613
Q3351.916185
95-th percentile686.6156069
Maximum686.6156069
Range674.6416185
Interquartile range (IQR)279.5867052

Descriptive statistics

Standard deviation222.6073966
Coefficient of variation (CV)0.8191070119
Kurtosis-0.8941779117
Mean271.7683934
Median Absolute Deviation (MAD)148.9479769
Skewness0.6802543401
Sum261984.7312
Variance49554.05304
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
309.014450910010.4%
 
351.916185717.4%
 
634.1647399687.1%
 
686.6156069616.3%
 
640.265896616.3%
 
467.982659282.9%
 
164.4190751262.7%
 
306.5231214212.2%
 
269.1242775202.1%
 
421.6445087181.9%
 
Other values (134)49050.8%
 
ValueCountFrequency (%) 
11.9739884420.2%
 
16.4277456610.1%
 
18.1387283220.2%
 
18.2167630110.1%
 
18.2832369920.2%
 
ValueCountFrequency (%) 
686.6156069616.3%
 
640.265896616.3%
 
634.1647399687.1%
 
467.982659282.9%
 
421.6445087181.9%
 
Distinct count4
Unique (%)0.4%
Missing0
Missing (%)0.0%
Memory size7.5 KiB
whber
952
qwnll
 
6
wsluu
 
4
aqfre
 
2
ValueCountFrequency (%) 
whber95298.8%
 
qwnll60.6%
 
wsluu40.4%
 
aqfre20.2%
 

Length

Max length5
Median length5
Mean length5
Min length5

longitude_destination
Real number (ℝ≥0)

Distinct count168
Unique (%)17.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean4.277023254771785
Minimum2.3553093
Maximum6.133331
Zeros0
Zeros (%)0.0%
Memory size7.5 KiB

Quantile statistics

Minimum2.3553093
5-th percentile3.216726
Q13.942542
median4.356801
Q34.482785
95-th percentile5.327627
Maximum6.133331
Range3.7780217
Interquartile range (IQR)0.540243

Descriptive statistics

Standard deviation0.5722349278
Coefficient of variation (CV)0.1337928025
Kurtosis0.6801358516
Mean4.277023255
Median Absolute Deviation (MAD)0.2236695
Skewness-0.03962129251
Sum4123.050418
Variance0.3274528126
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
4.360846868.9%
 
4.336531757.8%
 
4.356801757.8%
 
3.710675727.5%
 
4.715866565.8%
 
4.421101293.0%
 
4.432221293.0%
 
3.216726282.9%
 
4.482785202.1%
 
4.482076161.7%
 
Other values (158)47849.6%
 
ValueCountFrequency (%) 
2.355309320.2%
 
2.73634310.1%
 
2.86894310.1%
 
2.92580980.8%
 
2.99928610.1%
 
ValueCountFrequency (%) 
6.13333140.4%
 
5.85491720.2%
 
5.8061510.1%
 
5.74158110.1%
 
5.68333110.1%
 

latitude_destination
Real number (ℝ≥0)

Distinct count168
Unique (%)17.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean50.88996912116183
Minimum48.8809984
Maximum52.379128
Zeros0
Zeros (%)0.0%
Memory size7.5 KiB

Quantile statistics

Minimum48.8809984
5-th percentile50.570729
Q150.835707
median50.859663
Q351.017648
95-th percentile51.19923
Maximum52.379128
Range3.4981296
Interquartile range (IQR)0.181941

Descriptive statistics

Standard deviation0.2401170186
Coefficient of variation (CV)0.004718356539
Kurtosis18.52005554
Mean50.88996912
Median Absolute Deviation (MAD)0.111736
Skewness-0.7938371832
Sum49057.93023
Variance0.05765618261
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
50.859663868.9%
 
50.845658757.8%
 
50.835707757.8%
 
51.035896727.5%
 
50.88228565.8%
 
51.19923293.0%
 
51.2172293.0%
 
51.197226282.9%
 
51.017648202.1%
 
50.896456161.7%
 
Other values (158)47849.6%
 
ValueCountFrequency (%) 
48.880998420.2%
 
49.59999640.4%
 
50.28560310.1%
 
50.2910510.1%
 
50.37679810.1%
 
ValueCountFrequency (%) 
52.37912840.4%
 
52.08332910.1%
 
51.31243210.1%
 
51.28162610.1%
 
51.24102110.1%
 

mean_halt_times_destination
Real number (ℝ≥0)

Distinct count149
Unique (%)15.5%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean287.41924004029477
Minimum10.28323699422
Maximum686.61560693642
Zeros0
Zeros (%)0.0%
Memory size7.5 KiB

Quantile statistics

Minimum10.28323699
5-th percentile29.24566474
Q172.32947977
median180.5982659
Q3467.982659
95-th percentile686.6156069
Maximum686.6156069
Range676.3323699
Interquartile range (IQR)395.6531792

Descriptive statistics

Standard deviation238.8763177
Coefficient of variation (CV)0.8311076102
Kurtosis-1.25318412
Mean287.41924
Median Absolute Deviation (MAD)131.8988439
Skewness0.5434287667
Sum277072.1474
Variance57061.89516
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
640.265896868.9%
 
686.6156069757.8%
 
634.1647399757.8%
 
309.0144509727.5%
 
351.916185565.8%
 
467.982659293.0%
 
421.6445087293.0%
 
164.4190751282.9%
 
306.5231214202.1%
 
153.1156069161.7%
 
Other values (139)47849.6%
 
ValueCountFrequency (%) 
10.2832369930.3%
 
17.7745664710.1%
 
18.2832369910.1%
 
19.4335260110.1%
 
19.8236994250.5%
 
ValueCountFrequency (%) 
686.6156069757.8%
 
640.265896868.9%
 
634.1647399757.8%
 
467.982659293.0%
 
421.6445087293.0%
 

current_year
Categorical

CONSTANT
REJECTED

Distinct count1
Unique (%)0.1%
Missing0
Missing (%)0.0%
Memory size7.5 KiB
2016
964
ValueCountFrequency (%) 
2016964100.0%
 

Length

Max length4
Median length4
Mean length4
Min length4

current_week
Categorical

HIGH CORRELATION

Distinct count4
Unique (%)0.4%
Missing0
Missing (%)0.0%
Memory size7.5 KiB
41
348
42
267
43
191
40
158
ValueCountFrequency (%) 
4134836.1%
 
4226727.7%
 
4319119.8%
 
4015816.4%
 

Length

Max length2
Median length2
Mean length2
Min length2

current_day
Categorical

HIGH CORRELATION

Distinct count7
Unique (%)0.7%
Missing0
Missing (%)0.0%
Memory size7.5 KiB
Friday
212
Thursday
190
Tuesday
158
Monday
151
Wednesday
129
Other values (2)
124
ValueCountFrequency (%) 
Friday21222.0%
 
Thursday19019.7%
 
Tuesday15816.4%
 
Monday15115.7%
 
Wednesday12913.4%
 
Sunday626.4%
 
Saturday626.4%
 

Length

Max length9
Median length7
Mean length7.088174274
Min length6

is_weekend
Boolean

Distinct count2
Unique (%)0.2%
Missing0
Missing (%)0.0%
Memory size964.0 B
False
840
True
 
124
ValueCountFrequency (%) 
False84087.1%
 
True12412.9%
 

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Cramér's V (φc)

Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.

Missing values

Sample

First rows

id_codecurrent_datecurrent_timesource_namedestination_nametrain_namecountry_code_sourcelongitude_sourcelatitude_sourcemean_halt_times_sourcecountry_code_destinationlongitude_destinationlatitude_destinationmean_halt_times_destinationcurrent_yearcurrent_weekcurrent_dayis_weekend
0mckbezdplrukagl2016-10-0601:05:38 AMstation$143station$142SZAYASZwhber4.24339350.86672839.121387whber4.27354350.86833739.121387201640ThursdayFalse
1agxwrnbmzbyxsjg2016-10-0601:05:56 AMstation$133station$147ICXYXYwhber4.32622050.88083395.676301whber4.35680150.845658634.164740201640ThursdayFalse
2iqjojyewdyfshtj2016-10-0606:11:54 AMstation$632station$544ICWATwhber3.26454950.824506153.115607whber3.71067551.035896309.014451201640ThursdayFalse
3hssqexnzirioaag2016-10-0607:00:00 AMstation$296station$281ICWYRwhber5.59969550.61315287.130058whber5.56669550.624550269.124277201640ThursdayFalse
4lublknpfraiznhr2016-10-0607:00:09 AMstation$281station$266ICWYRwhber5.56669550.624550269.124277whber4.71586650.882280351.916185201640ThursdayFalse
5hgqkwjbpavdwmob2016-10-0607:00:15 AMstation$266station$130ICWYRwhber4.71586650.882280351.916185whber4.36084650.859663640.265896201640ThursdayFalse
6tcoajkwstpxkrdx2016-10-0607:00:19 AMstation$130station$147ICWYRwhber4.36084650.859663640.265896whber4.35680150.845658634.164740201640ThursdayFalse
7muqhmlfqyzozvkn2016-10-0607:00:25 AMstation$147station$150ICWYRwhber4.35680150.845658634.164740whber4.33653150.835707686.615607201640ThursdayFalse
8zdwfnxlivjlitzd2016-10-0607:04:27 AMstation$296station$147ICWYRwhber5.59969550.61315287.130058whber4.35680150.845658634.164740201640ThursdayFalse
9wznosynddwsawbv2016-10-0607:07:43 AMstation$266station$147ICVYRwhber4.71586650.882280351.916185whber4.35680150.845658634.164740201640ThursdayFalse

Last rows

id_codecurrent_datecurrent_timesource_namedestination_nametrain_namecountry_code_sourcelongitude_sourcelatitude_sourcemean_halt_times_sourcecountry_code_destinationlongitude_destinationlatitude_destinationmean_halt_times_destinationcurrent_yearcurrent_weekcurrent_dayis_weekend
954rzumrsbcrnuxzag2016-10-2907:26:37 AMstation$214station$178ICRYZRwhber4.48278551.017648306.523121whber4.42110151.217200467.982659201643SaturdayTrue
955pxmhcvnwktxqukn2016-10-2908:03:44 AMstation$241station$247ICYRYRwhber5.32762750.930822180.598266whber5.05003150.99334184.919075201643SaturdayTrue
956cugnfjqcwwqrjhu2016-10-2908:55:57 AMstation$272station$209ICYRYRwhber4.82404350.984406123.800578whber4.70823551.07414653.060694201643SaturdayTrue
957bxhlrxcgiapiaab2016-10-2908:56:07 AMstation$200station$185ICYRYRwhber4.56061451.135758154.413295whber4.43222151.199230421.644509201643SaturdayTrue
958uunreizjxarghpv2016-10-2909:14:01 AMstation$200station$185ICYRYRwhber4.56061451.135758154.413295whber4.43222151.199230421.644509201643SaturdayTrue
959pnfrvyxsejnehwu2016-10-2909:14:45 AMstation$544station$530ICZVXAwhber3.71067551.035896309.014451whber3.44784851.09229578.488439201643SaturdayTrue
960omsilbnrgbvkeak2016-10-2910:17:59 AMstation$530station$544ICZVZAwhber3.44784851.09229578.488439whber3.71067551.035896309.014451201643SaturdayTrue
961vkjvqmaaguaeqde2016-10-2910:39:10 AMstation$178station$147ICRYYWwhber4.42110151.217200467.982659whber4.35680150.845658634.164740201643SaturdayTrue
962iutnjhogthfpymb2016-10-2910:59:55 AMstation$147station$150ICZVXYwhber4.35680150.845658634.164740whber4.33653150.835707686.615607201643SaturdayTrue
963xwqxedeqlnimclu2016-10-2911:48:37 AMstation$525station$536ICZVXWwhber3.21672651.197226164.419075whber3.13386451.31243221.416185201643SaturdayTrue